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Abstract. The unit cost model is both convenient and largely realistic for 
describing integer decision algorithms over +, x. Additional operations like 
division with remainder or bitwise conjunction, although equally supported 
by computing hardware, may lead to a considerable drop in complexity. We 
show a variety of concrete problems to benefit from such non-arithmetic 
primitives by presenting and analyzing corresponding fast algorithms. 

1 Introduction 

The Turing machine is generally accepted as the appropriate model for describing 
both the capabilities (computability) and the complexity (bit cost) of calculations 
on actual digital computers; but it is cumbersome to handle when developing algo- 
rithms (upper complexity bounds) as well as for proving lower bounds, and therefore 
often replaced by algebraic models such as the random access machine (RAM). The 
latter operates on entire integers (as opposed to bits) and comes in various flavors 
depending on which primitives it is permitted to employ: e.g. incrementation, ad- 
dition, subtraction, multiplication, division, integer constants, bitwise conjunction, 
shifts "<— indirect addressing etc. Notice that both bitwise conjunction 
and integer division "div" (when the numerator is not a multiple of the denomi- 
nator) are non-arithmetic operations over Z yet commonly hardware supported by 
digital computers (see Section |4] below) . 

The choice among these instructions heavily affect a RAM's power in compari- 
son to the normative Turing machine; e.g. a decision based on polynomially many 
applications of (-I-, — , x)0 can, although possibly giving rise to exponentially long 
intermediate results, be simulated within RP Scho 79J; whereas polynomially many 
steps over (-I-, x,div) cover already NP |Scho79j : and over (+,— , x,&) even entire 
PSPACE |PSR74j . We are interested in the effect of these additional instructions 
to selected problems of complexity much lower than polynomial; specifically for 
accelerating to linear and sublinear running times as in the spirit of the following 

Exampl * *l 1 a) Over (+, — , x,div), not only primality test but even factoriza- 
tion of a given integer x is possible in time O{logx) linear in its binary length. 

b) Given a, fc G N and some arbitrary integer b > , one can compute over 
(-|-, — , X , div) within 0{-/k) steps. 

c) Over (+,— , X , div) and using indirect addressing, the greatest common divisor 
gcd(a;, y) of given integers can be calculated in OilogN / lo^ogN) steps, where 
N = max{x, j/}. 

d) Over (+,— ,&,^,^) (but without indirect addressing as for Bucket Sort), n 
given integers xi, . . . ,Xn can be sorted in 0{n); 

over {+, — , X, div) this can be achieved in 0{n ■ loglogmax^ Xi). 

* M. Ziegler is supported by DFG project Zil009/1-1 

** but no test for inequality, which in the sequel we shall implicitly permit 
** * We thank RiKO JACOB for pointing us to Items d) and e) in this example. 
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e) 3SUM, that is the question whether to given integers xi, x„, zi, z, 

there exist i,j,k with Xi + j/j = Zk, can be decided in 0{n) operations over 
(+,-,x,<§^). □ 

Otherwise, 3SUM is considered 'n^-complete' in a certain sense |GaOv95] . Regard- 
ing d), describing the permutation mapping the input to its sorted output requires 
f?(n • logn) bits. Similarly, compare c) with the running time ©(log A^) of the Eu- 
clidean algorithm attained on Fibonacci numbers x = Fn = N, y — Fn-i- And 
finally observe that in b) mere repeated squaring, i.e. without resorting to integer 
division, yields only running time 0{k); cf. Section [3.21 below. 

Pro of, a) Se e |Sham79j; b) see |BMST92| or Section below; c) see |Bsho89j : d) 
see [KiRe83| . and [Han04' for an account of more recent results on sorting using 
various sets of operations and costs. 

Claim e) can be concluded from (the much more general considerations in- 
cluding word-length and non-uniform instruction cost analyses in) work [BDPOSj 
which, applied to our setting, simplifies to the following observation: For < 
ao, . . . , aN-i,bo, . . . , Bn-i < 2*-\ let A := J^t'o^ ' 2", B b^ ■ 2", and 

C := 'Zt 2*"^ ■ 2**. Then 

yi^O,...,N~l: a^>b^ ^ {A + C - B) &iC = C . 

In particular, subject to the above encodings, "3i : ai = hi" can be tested in constant 
time over (-I-, Now such an encoding can be obtained for the double sequence 

{xi +yj)i+nj in linear time 0{n) over (+, — , x); cf. e.g. our proof of Observation [TBI 
below. □ 



2 Polynomial Evaluation 

occurs ubiquitously in computer science, e.g. in connection with splines or with 
Reed-Solomon codes. It is commonly performed by Horner's method using 0{d) 
arithmetic operations where d denotes the polynomial's degree. While this has been 
proven optimal in many cases [BCS97[ Theorem 6.5], over (certain subrings of) 
integers it is not. Specifically, if the integer polynomial to be evaluated has coeffi- 
cients which are small (e.g. only Os and Is) compared to its degree, Horner's method 
can be slightly accelerated: 

Proposition 2. Given po, . . . ,Pd-i G ^ with \pn\ < P G N and x e Z, X)n=oPn^" 
can be calculated using 0{d/ \ogp d) operations over (-I-, x). 

Proof. We treat the terms with negative coefficients separately and may therefore 
suppose pn > 0. For fc S N to be chosen later, decompose p into [d/fc] polynomials 
Qi € N[X] of degree less than k. Notice that, since their coefficients belong to 
{0, 1, . . . , P — 1}, there exist at most P'^ distinct such polynomials. Evaluate all 
of them at the given argument x G Z: P'' separate executions of Horner's method 
result in a total running time of 0{k ■ P*^); but dynamic programming achieves the 
same within 0{P^). In a second phase, apply Horner to evaluate Y^lt!^^ ■ 
aX Y = x^ and obtain p{x) as desired: Together this leads to a total number of 
operations of 0{d/k + P'') which, for k « logp d— logp logp d, becomes 0{d/ logp d) 
as claimed. □ 



2.1 Throwing in Integer Division 

The running time obtained in Proposition [2] is sublinear but still dependent on 
the degree d of the polynomial under consideration. For fixed p, [MST89 Bsho92| 
had observed that, surprisingly, this dependence vanishes when admitting integer 
division as operational primitive: 
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Algorithm 3. Fix X G N and an integer polynomial with nonnegative coefficients 
p G N[a;]. Then, for Z Cz N sufficiently large, one can evaluate {0, 1, . . . ,X} 3 x 
p{x) as follows: 

1) Input a; € {0, 1, . . .,X}. 

2) Compute Z'*+i div (Z - x) 

3) multiply the result to p{Z) 

4) integer- divide this in turn by Z'^ and 

5) output the remainder from division by Z . □ 

By pre-computing and storing the integers Z, Z"^, and p{Z) one arrives at 

Corollary 4. Over integer operations {+,— , Xc, div}, an arbitrary fixed polyno- 
mial p e Z[a;] can be evaluated on an arbitrary finite domain D C Z m constant 
time independent ofp and (of its degree and of) D. 

Here, X;. denotes unary multiplication (scaling) of the argument by a fixed integer 
constant. Indeed, evaluation of p at a negative argument —x reduces to the evalu- 
ation at positive x of p{—x); and every integer polynomial is the difference of two 
with nonnegative coefficients. 

Concerning the correctness of Algorithm [3l we repeat a proof due to [Bsho92| 
and obtain the following strengthening used in Section UJ 

Scholium 5. Let p — X]n=o^'"^" "-^ degree at most d and norm \\p\\i :— \po\ + 
• • • + \Pd\ ^ Pj then every Z > maxjA*^ • P, {X"^ + 1) • X} is 'sufficiently large' for 
Algorithm\^ to succeed. 

Proof It holds Z-^+i div(Z-a;) = - |)J = L^'' • Em=o W^)™J = 



\z'^ + z'^-^x-^--- + Zx'^~^ + x'^ + V z'^ix/zyi 



= Z^ + Z^-'x + --- + Zx''-^+x'' -x''+V(2--)<i 
since Z > (a;'^ + 1) • x by prerequisite. Hence the result of Step 3 equals 



Ed 
n,r. 



,m=0 -"^ — 'fc=0 

where < qk < Z ior k < d because Z > x'^ ■ {po + • • • + pd)- In particular 
Qd = X]n=oP" ' ^" ~ ^'(■^) isolated by Steps 4 and 5. □ 



2.2 First Consequences 

Corollary 6. Over integer operations {+, — , x c, div}, every finite integer sequence 
yo,yi, ■ ■ ■ ,yN (or, more formally, the mapping {0,1,..., A^} B n i-^ y„) is com- 
putable in constant time independent of (the length Nof) the sequence! 

Proof. Consider an interpolation polynomial p S Q[A] of degree < + 1 such that 
p{n) = n = 0, . . . , iV. Take M G N such that M - p £ Z[X]. Apply Corollary H 
in order to calculate n i-^ M ■ p(n) in constant time, then integer-divide the result 
by M. □ 

It has been shown in |JMW89| that every language L C Z (rather than Z*) which 
can be decided over — , Xc, div} at all, can be decided in constantly many steps; 
that is in time independent of the input x G Z — but of course depending on L. 

Observation 7. Every finite language L (- 1, is decidable over integer operations 
{+,— , Xc,div} within constant time independent of L. 

Proof. Let L C {0, 1, ... , N} and apply Corollary [S] to the characteristic sequence 
{yo, . . . , j/at) of L, defined by j/„ 1 for n £ L and y„ := for n ^ L. □ 

The next subsection implies the same to hold for finite sequences (yo, at) in Z"* 
and for finite languages L C Z'' as long as d is fixed. 
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2.3 Multi-Variate Case 

We extend Algorithm [3] to obtain 

Proposition 8. Over integer operations {+,— , x,div}, any fixed polynomial p G 
Z[xi,. . . ,Xn\ can be evaluated on an arbitrary finite domain D C Z" in time 0{n) 
independent of p and X . 

Proof. We devise 2" separate algorithms: one for each of the polynomials p(±a;i, ±X2, . . . , ±a;„) 
to be evaluated at non-negative argument vectors x € N". Then, for a given input 
in Z", one can in time 0{n) determine which of these polynomials to evaluate at 
|a;2|, • ■ • , \xn\) in order to yield the aimed value p{x). Moreover decomposition 
of a polynomial into a part with positive and one with negative coefficients reduces 
to the case p = Ylt^ i„=o ^ii,---,in ■ ■ ■ ■ x^n with £ N. 

As in Equation ^ on pH Z'^ d±v{Z - x) equals Z'^-'^ + Z*^-^ • a; H \- Z ■ 

x'^^'^ + x"^^^ for all integers Z > Q{x''-). Applied to X2 and Z2 := Z''-^ one obtains 

[Z'^' d±v{Z'^ - X2)) ■ {Z^ div(Z - Xi)) = Yf-'^ _^ z'^'-l-{d^2+^l) . ^d^2 . 

and inductively, using 0{n) operations from {+,— , x,div}. 
Then multiply this counterpart to Step 2) in Algorithm [3] with the constant 



E"* ^ 2;d"-l-{d"-h„ + ---+di2+ii) . 

il ,....i„=0 



p{Z,Z'^,Z'^\...,Z'^" ') = ^ a^, Z''+'^'^+'^'''+-+'^" 

^ — 'il ,...,in—0 ' ' " 

(cmp. Step 3) and extract the term corresponding to Z''""^ (Steps 4+5). 



2.4 Evaluation on all integers: Exploiting Bitwise Conjunction 

As opposed to Horner's method. Algorithm [3] and its above generalization restricts 
polynomial evaluation to arguments x from an arbitrary yet finite domain. Indeed 
Scholium [5] derives from a bound X on a; one on Z to avoid spill-overs in the Z-ary 
expansion of the product of Z"^^^ div{Z — a;) with p{Z). Now Z can of course be 
chosen adaptively with respect to x, but how do we then adapt and calculate p{Z) 
accordingly? This becomes possible when allowing, in addition to integer division, 
bitwise conjunction as operational primitive. 

Proposition 9. Fix p e Z[x] of degree d. Then evaluation Z 9 x 1— > p{x) is 
possible using O (log d) operations over {+,—, x , div, 

This is much faster than Horner and asymptotically optimal. 

Proof. As usual we may presume both p's coefficients po, . . . ,pd and x to be non- 
negative. Moreover, since p is fixed, one may store p{Y) as a constant for some 
sufficiently large integer Y, w.l.o.g. a power of two. Notice that Y —1 can then serve 
as a mask for bitwise conjunction: for < Qn < Y and Z a multiple of Y, it holds 

( E„ 9" • ^ ((^ - 1) ■ ^'") = 9™ • ^™ ; 

compare Figure [T] Now given x G N we compute, using repeated squaring within 
O{\ogd), Z' := x'^^'^] hence Z :— Z' ■ Y satisfies the conditions of Scholium [5] 
Then, using another 0{\ogd) steps, calculate Z''^^^ and, from that, X]i=o ^" ~ 
^/d+i div(Z' - 1) as in Equation (j]). Multiply the latter to p{Y) and, to the result. 
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Fig. 1. Expansions of the calculations employed in the proof of Proposition [9] 



apply bitwise conjunction with J2i=oO^ ^ ^) ' (■^'^)^ the latter can be obtained 
again as (r - 1) • ((Z'^'+i • y'+i) div(Z' • r - 1)). Based on the mask property of 

Y — 1 mentioned above, this yields X)iLo?'» ' {Z'Yy = p{Z): now continue as in 
Algorithm [31 □ 

A review of the above proof reveals that the ©(logd) steps are spent for calculating 
Z' = a;**"*"^ and Z"^; everything else proceeds in constant time based on pre-computed 
constants like y^. Now when x < 0{2'^), x'^ and Z'^ are faster to obtain starting 
the repeated squaring, rather than from a;, from 2'^ and 2*^ for O(loglogx) steps, 
respectively. Alternatively we may choose d as a power of two to invoke Example[T]D) 
and arrive at 

Scholium 10. Fix p e Z[a;] of degree d. Given a; £ Z, one can calculate p{x) using 
O(loglog |a;|) operations over {+, — , x , div, If in addition some arbitrary integer 
y > \x\'^ is given, also running time 0{y/mm{\ogd, loglog |x|}) is feasible. 

As in Proposition [51 this extends to the multi-variate case: 

Theorem 11. Over integer operations {+, — , x , div, any fixed polynomial p € 
Z[a;i, . . . , x„] of maximum degree less than d can be evaluated in time 
0(n ■ minjlogd, loglog max^ |a^i|}). 

//, in addition to the argument {xi, . . . ,Xd), some integer y > (max^ |a;i|)'' ^ is 
given, the running time reduces to 0{n ■ -y/minjlogd, loglog max^ |a;i|}). 

Proof. According to the proof of Proposition [H for some integer Z > Q{x'^ ), 
we need to know (Z'^, Z'^\ . . . , Z'^") and p{Z, Z'^,..., Z'^"''). Since the latter is a 
wnivariate polynomial in Z of degree < c?"^^, the proof of Proposition [51 shows how 
to obtain this value from p{Y, Y'^, . . . , Y"^ ) using bitwise conjunction. Repeated 
squaring, either of max, \x^\ or of (2'*, 2'^'' , . . . , 2'^"), yields {Z'^, Z'^\ . . . , Z*^") in time 
0{n ■ minjlogd, loglog maxi |a;i|}); the additional input y accelerates this to 0{n ■ 
i/minj log d, loglog maxi \xi\}) according to Example [Da) . □ 

2.5 Storing and Extracting Algebraic Numbers 

When permitting not but only (+, — , x div), Horner's seems to remain the 
fastest known algorithm for evaluating an arbitrary but fixed polynomial on entire 
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N. Its running time 0{d) leaves a doubly exponential gap to the lower bound of 
l2(loglogd) due to [LBMH93[ Corollary 3]. 

Question 12. Does every (fixed) polynomial p e N[a;] admit evaluation x i— > p{x) on 
all integers x G N in time o(degp) over (+, — , x, div) ? 

In view of the previous considerations, the answer is positive if one can, from given 
X within the requested time bounds and using the operations under consideration, 
obtain the number p{Z) for some Z > fl{x'^) where d > degp. To this end in 
turn, choose Zn := Y ■ 2" where F = 2'^ > ||p||i and encode the sequence p{Zn) < 
Zn ■ \\p\\i < 2^+'^"', where n e N and K -.^ k ■ {d + 1), into the binary expansion of 
a real number like 

Pp V p(Z„) • 2-"-(^+'^") . (2) 

^ — 

Then, given a; G N, it suffices to approximate pp up to error e < 2^^"^''" for some 
n > Q{d ■ log a;) in order to extraclu the corresponding p(Z„). 

Lemma 13. Fix a G R algebraic of degree < S. Then, given n G one can 
calculate u,w G N such that \a — u/v\ < 2^" using 0{6 ■ logn) operations over 
(+,-,x). 

Proof (Sketch). Apply Newton Iteration to the minimal polynomial q G Z[x] of 
a. Since the latter is fixed, q, q' , and an appropriate starting point for quadratic 
convergence can be stored beforehand. 0{logn) iterations are sufficient to attain 
the desired precision; and each one amounts to evaluating q and q' at cost 0{S) via 
Horner. □ 

So when permitting a mild dependence of the running time on x and if pp is algebraic 
of degree o(degp), we obtain a positive answer to Question 1121 

Proposition 14. Let p G N[x] be of degree < d and suppose that 2^^*" is 
algebraic of degree < S. Then N 3 x i-^ p{x) can be calculated over (+,— , x,div) 
using 0{5 ■ loglogx) steps. 

Unfortunately the question whether Ylm 2^^^" is algebraic (not to mention what 
its degree is) constitutes a deep open problem in Number Theory [RibeOOi Sec- 
tion 10. 7. B, Example 1, p. 314]. 

We are currently pursuing a different approach to Question [T^ with a mild 
dependence on x: namely by exploiting integer division in some of the algorithms 
described in [FiducSS] in combination with the following 

Observation 15. Let p G ^\x\ be of degree < d and c G N. Then the integer se- 
quence p{l),p{c),p{c'^), . . . ,p(c"), ... is linearly recurrent of degree < d; that is there 
exist ao, . . . , a^-i, G Z such that p{c^^^) = (ai • p(c") + • • • + • p(c"~''+"'^))/ao 
for all n G N. 

Proof. For fc = d — 1, the [d + 1) polynomials p{cx) , p{x) , p{x / c) , . . . ,p{xc^'') all 
have degree < d and therefore must be linearly dependent over Q: qo ■ p{cx) + qi ■ 
p{x) + • • • + qk+i ■ p{xc^^) = 0; w.l.o.g. qi G Z. Choosing k minimal implies go 7^ 0- 

□ 



Strictly speaking, this approximation does not permit to determine e.g. the least bit of 
p{Z„) due to iterated carries of less significant ones; however this can be overcome by 
slightly modifying the encoding to force the least bit to be, e.g., zero. 
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3 Applications to Linear Algebra 

Naive multiplication of n x n matrices takes cubic running time, but V. Strassen 
has set off a race for faster methods with current record 0{n'^) for oj < 2.38 held 
by D. Coppersmith and S. Winograd; see |BCS971 Section 15] for a thor- 
ough account. However these considerations apply to the uniform cost model over 
arithmetic operations +, — , x where division provably does not help [BCS97[ The- 
orem 7.1]; whereas over Z when permitting integer division as a ?ion-arithmetic 
operation, optimal quadratic running time can easily be attained: 

Observation 16. Given A e Z^""" and B e Z"^™, one can compute C := A ■ B ^ 
jkx-m y^gijig 0(^{k + m) ■ nj operations over {+, — , x, div}. 
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Fig. 2. Encoding matrices (oi/) and (fo^.j) into integers a, (3 and decoding {ci_j) 
from a ■ [3 



Proof. We want to calculate Ci^j = X]"=i ' for i = 1, . . . ,k and j = 1, . . . , m. 
W.l.o.g. fli f , b^ j > 0; otherwise decompose. Choose Z > (max^^^ ai_f)-(max£ j bg j)-n; 
then compute 

k n n m 

" E E • and f3:^J2J2 ^^'^ ' . 

i=i e=i t=i j=i 

As indicated in Figure [21 the Z-adic expansion of their product 7 := a • /3 contains 
all desired numbers aj at 'position' ^2n(j-i)-i-(n-i)+2nm(i-i) fj.Qjjj which they are 
easily extracted using division with remainder. □ 

Observe that most of the time is spent encoding and decoding the input and output, 
respectively. However the right factor is encoded differently from the left one; hence 
binary powering yields computation of A'^ from A g Z"'*" within 0{n^ ■ \ogk) 
whereas a running time of 0{n^ -f-logfc), i.e. by encoding and decoding only at the 
beginning and the end, seems infeasible. We shall return to this topic in Section [3?2l 



3.1 Determinant and Permanent 

Over arithmetic operations — , x), the asymptotic complexities of matrix multi- 
plication and of determinant computation are arbitrarily close to each other jBCS971 
Section 16. 4]. The same turns out to hold as well when including integer division: 
not by means of reduction but by exhibiting explicit algorithms. They reveal also the 
permanent to be computable in information-theoretically optimal time — whereas 
over (+, -, x) it is Valiant NP-complete }BCS97| Theorem 21.17]. 

Proposition 17. Given A g Z"'*", one can calculate both det{A) and perm(^) 
within 0(71^) operations over {+,— , x,div}. 
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Notice that, as opposed to Theorem [TTl bitwise conjunction is not needed! 
Proof. Both 

det(yl) ^ sgn{TT)ai^^(^iy ■ a„^^(^„-) and perm(A) = ^ ai^^(i) • • • a„^^(„) 

are polynomials in variables Xi_i^n{j-i) ■= of maximum degree less than 
d :— 2 with coefhcients 0,±1. As in Section [2.41 it thus suffices, in view of the 
proof of Proposition [51 to (decompose det into one part with positive and one with 
negative coefficients and to) obtain their respective values at af = (xq, . . . , Xn^_i) := 

(Z', . . . , where Z' :^ Z ■ Y for Z := (maxfc |xfc|)2"' and Y denotes 

some appropriate constant. Now x can be computed in 0{n'^); and so can the 
quantity 

n-l 

TrGP i=0 ^ " ' -ireV ttEV 

equating to Card(7') • where 7r'(i) := 7r(i + 1) - 1 and 7^ denotes 

either 5„ (permanent. Card — n\) or one of {tt G 5„ : sgn(7r) = —1} or {tt G 5„ : 
sgn(7r) = +1} (determinant. Card = nl/2). □ 

Remark 18. Just recently have we been pointed out a more direct algorithm cal- 
culating PJ"^" 9 j4 i-^- perm(^) over (+,— , x,div) in 0{n^) steps inspired by the 
proof of [ABKM06[ Proposition 2.4]. 

3.2 Integer Matrix Powering: Exploiting GCD 

The unit cost assigned to multiplication "x" allows to compute high powers like 
of a given input a to be calculated by squaring fc-times. However the presence 
of integer division and the additional input of a sufRciently large but otherwise 
arbitrary integer b yields in only 0{^/k) steps; recall Example [Jd) . We now 
generalize this to powering integer matrices. 

Definition 19. For X,C e U^""^, let gcd(C) := gcd(cy ■ I < i,j < d) and 

X rem C :— {xij rem gcd(C)) extend the gcd and remainder from natural num- 
bers to integer matrices. 

Also write "X = Y (mod C) " i/gcd(C) divides each entry Xij — yij of X — Y . 

For fixed C, this obviously yields an equivalence relation on Z'*^''; in fact a two-sided 
congruence relatior{3, since one easily verifies: 

Lemma 20. a) If X = Y (mod C), then S ■ X -T = S -Y -T (mod C). 

b) For each n^mt holds X" = y" (mod X ~Y). 

c) X remC = X (mod C). 

d) IfO< x,j < gcd(C) then X rem C ^ X . 

Claim b) follows from the non-commutative binomial theorem in connection with 
a). 

Now apply Lemma [20| to n := 2^ X := A^'^' ^\y -.^ and C := Y - X in 
order to conclude 

= (A^ ) = rem [B ~ A^ ) (3) 

* conversely, every two-sided ideal in 2,'^'^'^ is of this form [Jaco64l Proposition III. 2.1] 
but we won't need this fact 
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provided that gcd(i3 ~ ') is larger than the entries of . In the case d = 1 

treated in Example [T]d), this amounts to the condition that B = (b) simply be 

.2 

larger than a . In the general case d > 1 , Section 13.31 below describes how to 
obtain matrices B appropriate for our 

Theorem 21. Given k E N, A E f^dxd. furthermore given some B G f^'^^d. g^f,]^ 
that, for all < c^- < d^ ■ (max^j aijY —'■ r, it holds gcd(i? — C) > r; then one 
can compute using 0{d^ ■ -y/log k) operations over {+,— , x,div, gcd}. 

Proof. It suffices to treat the case k — i'^. First calculate -B^* using repeated squaring 
within 0{d^ ■ £) according to Observation [TBI Then proceed, inductively for j — 
!,...,£, from A^*'^ ^' to A^"' according to Equation ([3]), again at cost 0{d'^ ■ i) 
each. Indeed, the m-th power C of a d x d-matrix A with entries in {0, 1, . . . , s} has 
entries in {0, 1, ... , d"^~^ ■ s"*}; hence the prerequisite on B shows that Lemma [^OH) 
applies. □ 

The binary gcd operation is used to compute gcd(C) in 0{d'^) steps and then 
X rem C according to Definition [121 In fact we were surprised to realize that the 
above sequence A^ , j = 0, . . . , /c, is obtained according to Equation ([3]) merely by 
componentwise remainder calculations. 



3.3 Locally Lower-Bounding the GCD 

Upper semi-continuity of a real function / : M.'^ — > M at x means that, for arguments 
u sufSciently close to x, the values f{u) do not drop below f{x) too much; recall e.g. 
|Rand68l Chapter 6.7]. Now the greatest common divisor (gcd) function is discrete 
and such topological concepts hence inapplicable in the strict sense. Nevertheless 
one may say that gcd does admit points x arbitrarily close to 'approximate' upper 
semi-continuity: 

Lemma 22. To d, r, s G N there exist xi,X2, ■ ■ ■ ,Xd G N such that, for allvi, . . . ,Vd G 
{0, 1, . . . , s — 1}, it holds gcd(xi + vi, . . . , Xd + Vd) > r. 

Proof. Take pairwise coprime integers p^j > r, v E {0,1, . . . , s — 1}''; e.g. distinct 
prime numbers will do. For i = 1, . . . , c? and j = 0, . . . , s — 1 let Uij :— Yiv v =j P'"- 
Then, for fixed i, the numbers Ui^, Ui^i, . . . , Ui,s_i are pairwise coprime themselves. 
Hence, by the Chinese Remainder Theorem, there exists Xi E N such that Uij 
divides -I- j for all j = 0, 1 , . . . , s — 1 . In particular p^, which is common to all u^^^ . , 
divides Xi + Ui for each i = 1, . . . ,d; and thus also divides gcd(a;i + vi, . . . , Xd + Vd): 
which therefore must be at least as large as > r. □ 

Scholium 23. a) Xi, . . . ,Xd according to Lemma l^ can be chosen to lie between 

and (r • 5)'^^^) where S :== s'K 
b) It can be constructed (although not necessarily within this bound) using 0{S) 
operations over (+, — , x , div, gcdex). 

Here "gcdex" denotes the extended (binary) gcd function returning, for given x,y E 
N, s,t eTL (w.l.o.g. coprime) such that gcd(a;, y) = sa; + ty. 

Proof, a) According to the Prime Number Theorem, the fc-th prime pk has mag- 
nitude 0(k ■ logfc) and there are at most 7r(n) < ©(n/logn) primes below n. 
Hence the first prime at least as large as r has index fc^ < 0(r j logr); and we 
are interested in bounding the product N = p^^ ■ ■ ■pk^.+Sj that is basically the 
quotient of primorials (r + £)i^/r^ where r + £ — Pk^+s = r + {S ■ log5). It has 
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been showi:|§ jMoVa73] that 7r(r + i) — 7r(r) < 2tt{£) holds; that is, between r 
and r + £ there are at most 0{£/ logi) = 0{S) primes; and each obviously not 
larger than r + £. Hence (r + ^)#/r# < (r + £)0(^/i°g^) < (r ■ i)0{e/ log e) fo^, 
e = 0{S ■ log S). 

b) Pairwise coprimc integers Pi > r can be found iteratively as pi :— r, p2 := r + 1, 
P3 :— pi ■ P2 + 1, and Pi+i :~ pi - ■ -pi + l. Then apply the next lemma. □ 

Lemma 24 (Chinese Remainder). Given integers ai,...,a„ and pairwise co- 
prime mi, . . . , m„ g N, one can calculate x G N wit/i a; = (mod mi) for i — 
1, . . . , n mi/i 0(logn • X]"=i logm^) operations over (+, — , x , div). When permit- 
ting in addition gcdex as primitive, the running time drops down to 0{n). 

Proof. Calculate N := mi • • • m„ and, for each i = 1, . . . , n, :— tiN /mi where 
1 — gcd{mi,N/mi) — Simi -\-tiN/mi with Si,ti e Z returned by gcdex. Then it 
holds Ci = 1 (mod m^) and = (mod mj) for j ^ i; hence x :— ei ■ ■ ■ e„ satisfies 
the requirements. 

When working over (+,— ,x,div), the extended Euclidean algorithm computes 
gcdex(mi, iV/mi) within 0{\ogN) — ^(X^jlog™^) steps, for each i = l,...,n 
separately: leading to a total running time of 0{n ■ log A^). In order to improve this 
with respect to n, arrange the equations "a; = Ui (mod mi)" , i — 1, . . . ,n, into a 
binary tree: first compute simultaneous solutions to y = 02 j (mod m2j) and y = 
a2j+i (mod m2j+i) for j — 1, . . . ,n/2; then solve adjacent quadruples as a; = y2j 
(mod ■ m4j_|_i) and x = 2/2j+i (mod m4j^2 • "14^+3) for j — 1, . . . ,n/4; and so 
on. The k-th level thus consists of solving n/2'' separate /c-tuples of congruences 
involving disjoint fc-tuples out of mi, . . . , m„; that is, the corresponding extended 
Euclidean algorithms incur cost 0{J2i log™i) independent of A; = 1, ... , C'(logn). 

□ 

3.4 Constructing Primes Using Integer Division 

The (last of the) S pairwise coprime numbers pj > r (and thus also the integers 
Xi) computed according to Part b) of Scholium [23] are of order /?(r^^ ) and thus 
much much larger than the ones asserted feasible in Part a) by choosing pj as prime 
numbers. This raises the question on the benefit of our non- arithmetic operations 
for calculating primes, i.e. for solving Problem (b) mentioned in the beginning of 
|Ribe96[ Chapter 3] and addressed in Section II thereof. 

The Sieve of Eratosthenes finds all primes up to N using 0{N) operations over 
(+, — ). This can be accelerated |Prit81 j to 0{N / loglog A^); which is almost optimal 
in view of the output consisting of 0{N / \ogN) primes according to the Prime 
Number Theorem. This also yields a simple randomized way of finding a prime: 

Observation 25. Given e N, guess some integer N < n < 2N. Then, with 
probability 0(1/ log N), n is a prime number: hence after 0{logN) independent 
trials we have, with constant probability, found one. Using Example [TJaj to test 
primality, this leads to 0{\og^ N) expected steps over (+,—, x,div). 

Indeed the Bertrand-Chebyshev Theorem asserts a prime to always exist between 
N and 2A^. This trivial algorithm can be slightly improved: 

Proposition 26. Given N € N, a randomized algorithm can, at constant probabil- 
ity and within 0(log^ N/ loglog A'^) .steps over (+, — , x , div), obtain a prime p > N . 



We are grateful to our colleague Stefan Wehmeier for pointing out this bound! 
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Proof. First check whether N itself is prime: by testing whether N divides {N — 1)1 
(Wilson's Theorem); using integer division, this can be done in ©(log A^) operations 
over (+,— , x,div) |Sham791 Section 3]. From that, each adjacent factorial (A^ + 
k)\, k ^ Q, . . . , K — 1, is reachable in constant time: that is, after having tested 
primality of N, corresponding checks for N + 1,N + 2,...,...,N + K are basically 
free when K := ©(logiV). 

So now guess some ©(log 7V)-bit number M < N and then test integers N + 
M, N + M + 1, . . . , N + M + K in total time ©(log N) as above. We claim that this 
succeeds with probability ]7(loglog A^/log A^), hence the claim follows by repeating 
independently at random for ©(log A^/ loglog A^) times. Indeed the Prime Number 
Theorem asserts between N and 2N to lie f2{N/ log N) primes; on the other hand 
every interval of length K between A^ and 2N contains at most tt{K) < 0{K/ log K) 
primes |MoVa73] : hence by pigeon hole, among these N/K intervals, a fraction of 
at least i7(log-ft'/log A^) must contain at least one prime. □ 

Concerning an even faster and deterministic way of constructing primes, we 

Remark 27. In 1947, W.H. Mills proved the existence of a real number 9 « 
1.3063789 |CaCh05| such that p„ := [0^" \ , n G N, yields a (sub-) sequence of primes 
with Pn+i > Pn- It is not known whether 9 is rational; if it is, one can straight- 
forwardly extract from 6 a prime Pn > 3" N within 0{n) = ©(log A^) steps over 
(+,-, x,div). 

But even if 9 turns out as an algebraic irrational, then still we obtain the same 
time bounds! Indeed, in order to compute [9^ \ , 

{9 + e)^ = 9"" + N-e-9''-'+y2''_ f^^.^'^.e^-'^ 

k — 2 \ f£ J 

<i 

shows that it suffices to calculate a rational approximation 9' of 9 up to error 
e ss 2^^ /N, according to Lemma [T^ in time ©(log A?^), and to then take [9'^ \ . □ 

4 Practical Relevance 

Any real computer is of course far from able to operate in constant time on arbitrary 
large integers, the above algorithms therefore not practical in any way. Or are they? 
The technological progress described by Moore's Law over the last decades includes 
an exponential increase in the width of processors' arithmetical-logical units (ALU). 
Indeed, nowadays CPUs can commonly operate on 64 or even on 128 bits in one 
single instruction: that is, the unit cost model is valid for surprisingly large inputs 
and likely to become valid for even larger ones. 
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Fig. 3. Polynomials and argument ranges for Algorithm [3] to work on x86-64 CPUs 



Specifically concerning Algorithm [3l it already now covers many polynomials of 
degree up to five (i.e. with six free coefhcients): one can easily see the largest inter- 
mediate result to arise from the multiplication in Step 3; which then gets integer- 
divided (and thus smaller again) in Step 4. This corresponds rather nicely to two 
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instructions provided by systems like AMD64 [AMD07[ Section 3.3.6] and Intel64 
|Intel071 Section 3.2]: mulq multiplies two 64-bit unsigned integers to return a full 
128-bit result; while divq obtains both quotient and remainder of dividing a 128-bit 
numerator by a 64-bit denominator. So whenever, in addition to the conditions on 
Z imposed by Scholium [5l < 2^"* holds, each step of Algorithm [3] translates 

straight-forwardly to one x86-64 instruction. Figure [3] lists some example classes of 
polynomial^ and argument ranges which comply with these constraint; the shaded 
areas indicate that Z can be chosen as a power of 2 to further replace the integer 
divisions in Steps 4 and 5 by a shift and a binary conjunction, respectively. This 
leads to the realization indicated in the left of Figure U) 



# Input : Z — x in /.rsi 
#Constants : 

# p{Z) in "/.rdi. 

# in 7.rdx:y.rax 

# (may occupy >64bit) 

# Z — 1 in y.ebx , 

# 64 - c/ ■ \og.^{Z) in "/.cl 
divq y.rsi 

mulq y.rdi 
shld y.cl , y.rax , y.rdx 
andl y.ebx , y.edx 
#Output : p{x) in y,edx, 

# y,rax destroyed. 



Fig. 4. X86-64 GNU assembler realization 



# X (byte!) in y.eax = Xecx; 

# d+1 coeff bytes in (y,esi) 
mulb (y.esi) 

xorl yebx.y.ebx 
movb 1 (y.esi) //.bl 
addl y.ebx , y.eax 
mull y.ecx 

mull y.ecx 

movb d(y.esi) //.bl 

addl y.ebx, y.eax 

# p{x) in y.eax, 

# y.ecx y.ebx y.edx destroyed 



of Algorithm [3] and of Horner's method 



In comparison with Horner's Method depicted to the right, this amounts to 
essentially the elimination of c? — 1 (out of d) multiplications at the expense of one 
division — in a sense a counter-part to the converse direction, taken e.g. in |GrMo94] . 
of replacing integer divisions by multiplications. 

Now an actual performance prediction, and even a meaningful experimental 
evaluation, is difficult in the age of caching hierarchies and speculative execution. 
For instance (e.g. traditional) 32-bit applications may leave large parts of a modern 
superscalar CPU's 64-bit ALU essentially idle, in which case the left part of FigureU] 
as a separate (hyper-)thread can execute basically for free. 

However even shorter than both Horner's and Bshouty's Algorithm for the eval- 
uation of a fixed polynomial p is one (!) simple lookup in a pre-computed table 
storing p's values for a; = 0,l,...,X. On the other hand when there are many poly- 
nomials to be evaluated, the tables required in this approach may become pretty 
large; e.g. in case of d = 3, X = 21, and ||p||i < 56 (right-most column in Figurc[3]), 
the values of p{x) reach up to X'^ ■ \\p\\\, hence do not fit into 16 bit and thus 
occupy a total of (X + 1) x 4 bytes for each of the = 487,635 possible 

polynomials p: far too much to be held in cache and thus prone to considerably stall 
a nowadays computer; whereas the 487, 635 possible 64-bit values p{Z) do fit nicely 
into the 4MB L2-cache of modern CPUs, the corresponding four byte coefficients 
per polynomial (cf. right part of Figure |4|) even into 2MB. One may therefore regard 
Algorithm [3] as a compromise between table-lookup and Horner's Method. 



polynomials of higher degree D can be treated as \D/{d -f 1)] polynomials of degree 
< d and then applying Horner's method to x'^'^^ 
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5 Conclusion 

We presented algorithms which, using integer division and related non-arithmetic 
operations like bitwise conjunction or greatest common divisor, accelerate polyno- 
mial evaluation, linear algebra, and number-theoretic calculations to optimal run- 
ning times. Several solutions would depend on deep open number-theoretical hy- 
potheses, showing that corresponding lower bounds are probably quite difficult to 
obtain. Other problem turned out as solvable surprisingly fast (and actually beating 
information-theoretical lower bounds) when providing some more or less generic 
integers as additional input. 

On the other hand, these large numbers would suffice to be of size 'only' doubly 
exponential and thus quickly computable when permitting Icftshifts ?y 2^ or, 
more generally, exponentiation Nx N 9 (x, y) i— > as primitive at unit cost. In view 
of the hierarchy "addition, multiplication, exponentiation" , it seems interesting to 
gauge the benefit of level £ of Ackermann's function A{£, •) to seemingly unrelated 
natural problems over integers. 
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